Novel clinical, molecular and bioinformatics insights into the genetic background of autism

Background Clinical classification of autistic patients based on current WHO criteria provides a valuable but simplified depiction of the true nature of the disorder. Our goal is to determine the biology of the disorder and the ASD-associated genes that lead to differences in the severity and variability of clinical features, which can enhance the ability to predict clinical outcomes. Method Novel Whole Exome Sequencing data from children (n = 33) with ASD were collected along with extended cognitive and linguistic assessments. A machine learning methodology and a literature-based approach took into consideration known effects of genetic variation on the translated proteins, linking them with specific ASD clinical manifestations, namely non-verbal IQ, memory, attention and oral language deficits. Results Linear regression polygenic risk score results included the classification of severe and mild ASD samples with a 81.81% prediction accuracy. The literature-based approach revealed 14 genes present in all sub-phenotypes (independent of severity) and others which seem to impair individual ones, highlighting genetic profiles specific to mild and severe ASD, which concern non-verbal IQ, memory, attention and oral language skills. Conclusions These genes can potentially contribute toward a diagnostic gene-set for determining ASD severity. However, due to the limited number of patients in this study, our classification approach is mostly centered on the prediction and verification of these genes and does not hold a diagnostic nature per se. Substantial further experimentation is required to validate their role as diagnostic markers. The use of these genes as input for functional analysis highlights important biological processes and bridges the gap between genotype and phenotype in ASD.


Background
According to the Diagnostic and Statistical Manual of Mental Disorders [1], Autism Spectrum Disorder (ASD) is associated with abnormalities in early developmental period in communication and social interaction and with restricted and repetitive patterns of behavior or interests. Cognitive skills such as intelligence, memory and attention, as well as language skills may also be affected in ASD. Forty-four percent of children identified with ASD has average and above average intellectual quotient (IQ > 85), 25% has below average IQ (71-85) and 31% is within the range of intellectual disability (IQ < 70) [2,3]. Since ASD is a heterogeneous disorder, researchers usually adopt two main classifications: one based on the presence or absence of intellectual disability (ID) and one based on oral language skills (i.e., verbal or minimally verbal children). Talli et al. Human Genomics (2022) 16:39 The most common is the one that classifies the following two main subgroups: those that ASD coexists with intellectual disability and those that have average or above average intellectual functioning, whose characteristics vary in terms of linguistic, cognitive and social skills from those with intellectual disability [4]. Another classification for children within the autistic spectrum is verbal and non-verbal or "minimally verbal" children, i.e., children who have very limited use of spoken language for communication purposes. The reception of language might also be affected, and the autistic symptoms are usually severe in terms of behavior [5][6][7][8]. It has been commonly believed that non-verbal cognitive abilities predict expressive and receptive language [2,3]. However, Hanson et al. [9] have shown that there are minimally verbal children with autism who do not have low non-verbal IQ, others with low both expressive and receptive language skills and others that have low expressive but good receptive language skills. Consequently, categorization of subgroups in ASD is problematic.
Additionally, the association of genetic loci with specific behavioral characteristics in ASD contributes significantly to the understanding of the influence of genetic factors on clinical phenotype. This connection arises from studies that in their methodology include, in addition to genetic analysis, behavioral assessment, such as language and cognitive assessment. Recently, various researchers have suggested that genetics provide a lot of information on clinical phenotypes of ASD rather than vice versa [6]. Several chromosomal copy number variants (CNVs) and single-nucleotide variants (SNVs) (such as deletions and duplications at chromosomal regions 1q21, 7q11.23, 15q11-13, 16p11.2, and 22q11.2) have been identified as genetic risk factors for ASD [7,8] and have shown to have predictive value for clinical phenotype of ASD [5]. For example, 15q11.2 duplications are linked to ASD and Schizophrenia [10] in addition to their connection to high rates of epilepsy [11]. Other deletions, including 16p11.2, have been linked to cognitive deficits such as intellectual disability [9,12] as well as developmental coordination disorder, phonological processing disorder, expressive and receptive language disorders [13]. Prognostication of the clinical profiles of individuals with ASD based on specific genes that could serve as reliable biomarkers is important for early diagnosis and eventually for early and effective treatment. These findings would not be possible without contemporary sequencing and bioinformatics methods.
The technological advancements of next-generation sequencing (NGS), including whole genome sequencing (WGS) and whole exome sequencing (WES), have enabled researchers to perform detailed gene variation analyses like genome-wide association studies (GWAS) en masse. This newfound accessibility to these technologies enables not only experimental high-throughput protocols to be undertaken but also provides clinicians with powerful tools for assessing disease pathogenesis, progression and outcome. It has also enabled clinicians to provide more gene guided counseling into matters like therapy (through pharmacogenomics), and pre/peri-natal consulting. However, there are a variety of factors that need to be taken into account especially due to the complex nature of various diseases and the idiosyncrasies of individual patients regarding their genetic background. These notions bring forward precision medicine.
In precision medicine, genetic variation screening provides an important tool for detecting high-risk individuals of specific genetic disorders. Odds ratio analysis employed in traditional GWAS helps ascertain diseasevariant associations by the occurrence frequency of these high-risk variants in non-control groups. These variants can act both protectively and as instigators of disease. To make this determination, researchers can employ polygenic risk score predictions by training risk models on pools of variants highlighted in specific case-control studies [14,15]. Alternatively, variant annotation using in-silico approaches like GEMINI [16] provide information for each variant found in a study's samples through several genomic databases (ENCODE [17], UCSC [18], OMIM [19], dbSNP [20], KEGG [21], and HPRD [22]) and informs on frequency (like ExAC [23] and 1000GP [24]) and proteinic impact of changes in amino acid coding due to these variants (ClinVar [25], COSMIC [26], CADD [27], Polyphen [28] and SIFT [29]). Current WHO criteria for classification and grading of ASD provide a valuable but simplified depiction of the true nature of the disorder. Moreover, it is often difficult to predict clinical outcome using the current grading scheme. The aim of this study is to elucidate, through clinical assessment and bioinformatics, the differences in the genetic background of different phenotypical manifestations of ASD. More specifically, it aims at investigating whether there are specific genes that can account for differences in the clinical profiles of children with ASD at the linguistic and cognitive level by reporting on the analyses of a new autistic patient WES dataset (n = 33). We first extracted the sequenced genotypes (WES) of blood samples of school-aged (6-12-year-old) children with ASD. We then conducted clinical assessment by administrating standardized tests of non-verbal IQ, memory, attention and oral language skills and separated them into mild and severe phenotypes in each of these cognitive and linguistic categories, based on these assessments. The next step was to identify common high-risk variants in the sample dataset previously found in the literature by searching through several genomic databases but perhaps also to identify de novo variants, not previously reported in the literature. Finally, we used a linear regression polygenic risk score machine learning algorithm to obtain biologically significant genes with the potential to aid in the grading of autistic samples based on their sequenced genotypes, derive specific molecular signatures from severe and non-severe subtypes of autistic samples and assess whether these molecular signatures outline functional subclasses. At this stage we should stress that given the limited number of patients in the dataset used in our study, results require additional validation using further experimentation. With this in mind, we report eighty-four identified variants which could be assigned to specific functional categories related to ASD and intellectual disability, as well as other disorders. Classification of our samples using these variants was in agreement with the clinical classification for our dataset with 81.81% prediction accuracy. The six samples that showed a differential molecular diagnosis were further assessed using clinical information in order to substantiate the classification provided by our risk model.

Participants
Thirty-three children with ASD that attended both mainstream and special education schools were recruited from private speech therapy centers. Only those children whose parents gave written permission to participate in the research were included in the study. All children were diagnosed with ASD by public hospitals and public medical-pedagogical centers according to the ICD-10 (https:// apps. who. int/ iris/ handle/ 10665/ 37958) and DSM-V [1] official criteria. Children were initially divided into two groups based on their non-verbal IQ. The first included 18 children (average age: 9.5 years) with typical non-verbal IQ (> 80 in Raven Progressive Matrices) (ASD_MH group) and the second included 15 children (average age: 8.5 years) with low non-verbal IQ (< 60 in RPM) (ASD_L group). We then divided them based on whether they were verbal (acquired spoken language) or minimally verbal (absence of spoken language) children with ASD. The criterion was their performance (score 0) in two language tasks that required spoken language (see below expressive vocabulary and narration tasks). There were 19 verbal and 14 minimally verbal children. Moreover, we divided them in two groups (severe and mild) based on their attention and memory skills. Regarding the attention skills, the criterion for a child to fall under the severe phenotype was performance under the 10th percentile in both auditory and visual attention tasks and equal or over the 10th percentile for the mild phenotype. There were 9 children in the mild and 24 in the severe phenotype concerning attention skills. Regarding the memory skills, the criterion for a child to fall under the severe phenotype was performance under the 10th percentile in both auditory and visual memory tasks. There were 18 children in the mild and 15 in the severe phenotype.
Participants were assessed at their school individually in one or two sessions of a total duration of 45 min. Moreover, blood samples were obtained by experienced microbiologists in microbiology laboratories and were then sent to a genetic lab for Whole Exome Sequencing analysis.

Clinical phenotype assessment
In this study, children were assessed with cognitive as well language tasks. Standardized tests for Greek were employed. More specifically, our assessment materials included:

Cognitive measures
Non-verbal IQ. Non-verbal IQ was assessed with the Greek version of Raven Standard Progressive Matrices [30,31]. Both standard scores and percentiles were taken into consideration.
Auditory and visual attention. Auditory and visual attention was assessed using three subtests of the Test for the Assessment of Attention and Concentration [32] (i) Sustained auditory attention, (ii) Sustained visual attention and (iii) Range of visual attention. The Total Attention Score of all three auditory and visual attention subtests was also calculated.
Verbal short-term memory, visual and auditory memory. There were totally seven measures: VSTM Sentence recall [32], VSTM word recall [33], Immediate visual memory, Delayed visual memory, Visual information recall, Information retention factor (Story Recall subtest of the Memory Test; see Narration below) and Recognition.

Language measures
Expressive vocabulary. It was assessed using the Greek version of Crichton Vocabulary Scales [31]. It contains 80 word definitions, presented orally, and arranged in order of increasing difficulty (interruption criterion: four consecutive errors). Only one child from the ASD_L group was able to name a few definitions, so for all the rest of the children in the ASD_L group, Picture Naming and Comprehension Subscale was administered.
Picture Comprehension. It was administered only to the ASD_L group, because all but one were minimally verbal. Receptive vocabulary was assessed using Picture Comprehension Subscale (Detection of Speech and Language Disorders Test Preschool, [DSLD Test] [34]), in which the child was asked to point to the picture (among 4) that corresponded to the word presented orally by the examiner.
Narration. Narration was assessed by using the Story Recall subtest of the Memory Test [33]. The child would listen to two short stories and repeat them back right after the examiner and after a short break (scoring: total number of elements and total number of sections s/he remembered correctly).

Sequencing, mapping, alignment and variant calling
Exome enrichment library was prepared with the Agilent SureSelectXT Human All Exon V6 kit as per the manufacturer's instructions. Read files (Fastq) were generated from the sequencing platform (Illumina Hiseq). The samples were sequenced in paired end, 2 × 100 bp mode and deep coverage was obtained with approx. 6-7 Gb per sample (approx. 100 × av. coverage). Quality assessment and trimming was performed using the FastQC version: 0.11.7 and FASTX version: 0.0.14 toolkits, respectively. The Burrows-Wheeler Aligner (BWA) [35], version: 0.7.15 was used to map the raw reads to the human genome (build hg19/b37). Duplicate reads, which are likely to be the results of PCR bias, were marked using Picard (http:// broad insti tute. github. io/ picard/) version: 2.6.0. Samtools [36], version: 0.1.19, was used for additional BAM/SAM file manipulations. The Genome Analysis Tool Kit (GATK) [37], version 3.6.0, Haplotype Caller method was used for single-nucleotide polymorphism (SNP) and insertion/deletion (indel) variant calling.

Variant annotation
Variants were annotated with gene functional data from Ensembl version 90 using the Variant Effect Predictor (VEP) tool, version 90.6. Known variants were labeled using the dbSNP (Release 147) allowing for rapid identification of novel variants. Additional exploration of the results was performed using GEMINI, version 0.20.0, which provides a framework for analyzing, filtering and exploring genomic variation.

Odds ratio analysis
PLINK [38] was used to perform odds ratio analysis for obtaining variants with high disease association. PLINK allows for the detection of variants more frequently associated with severe than non-severe cases, which are labeled as high-risk or pro-severe variants. In contrast variants being more frequently associated in non-severe than severe cases are labeled as protective or pro-nonsevere variants.

Risk score prediction using linear logistic regression analysis and risk model construction
Linear logistic regression fitting was performed using the PredictABEL [39] package (available in R). Specifically risk models were constructed using the fitLogRegModel function, and the predRisk function was used to assess their performance and predict risks. Additional functions available by the package were used for the various measures to assess model performance. Commonly in genetic risk prediction studies this includes: (i) plotting receiver operating characteristic (ROC) curves and calculating area under the curve (AUC) values using the plotRoc function and (ii) the reclassification table construction, net reclassification improvement (NRI) and integrated discrimination improvement (IDI) calculations using the reclassification function. The NRI and IDI are important comparative measures that provide an assessment of how well a new model reclassifies the data [40]. Graphical representation of results were attained using the plotRisk-Distribution function for plotting risk distributions, the plotDiscriminationBox function for plotting discrimination box plots and the plotPredictivenessCurve function for plotting predictiveness curves. Better model performance was achieved by substituting the glm (generalized linear model) function utilized by PredictABEL with the bayesglm function available from the arm R package [41].

Machine learning data analysis
The samples obtained were separated into 2 classes based on their non-verbal IQ representing severe autism (n = 15) and non-severe autism (n = 18) and concurrently used as input for training the linear regression model classifier described above. Assessment of classifier performance was achieved using a leave-one-out crossvalidation (LOOCV) procedure. During each round of cross-validation, each sample was removed recursively and feature selection was performed on the remaining samples in the dataset, the model was then trained and utilized to classify the left-out sample. For the feature selection (variants), PLINK was used. To find the optimal set of variants, the leave-one-out cross-validation method was performed by testing initially the top six variants and sequentially increasing the number of variants for each run until classification accuracy reached a saturation point with no further improvement. LOOCV performance was assessed using prediction accuracy, sensitivity, specificity, and Matthew's correlation coefficient (MCC). MCC is defined as a balanced measurement of the classification quality which takes into account true and false positives and negatives. MCC returns values within the range of [− 1, 1]. The flowchart of the classification procedure is shown in Fig. 1.

Literature-based genomics approach
To complement our de novo classification approach, which predicts if a sample fits into the two main autism measurements under investigation (mild and severe), we utilized a novel pipeline for the identification of genes which characterize the IQ, verbal, memory and attention measurements based on current scientific knowledge. This pipeline consists of several distinct steps: Creating a subgroup of the variants in each of our samples which exhibit homozygosity to the alternate allele of our reference.
Keeping only the above variants for each sample, divide the samples into sub-phenotype groups based on our original phenotypical assessment measurements.
Running all samples of a sub-phenotypical group through R's SuperExactTest [42] package to identify genes common among at least n-2 samples (where n is the total number of samples in each category).
Finally, comparing the genes in each sample grouping via VENNY [43] to identify genes which characterize a group but also genes which are common among them and describe a more generic autism genetic signature. This approach is visualized in Fig. 2.

Validation gene dataset
To validate our results versus genes that are already known in literature in connection to autism, we created a gene dataset by extracting all autism-related genes from 5 databases (AutismKB [44], SFARI [45], Step 2-Denotes the selection of a set of variants to be used during the leave-one-out classification (LOOCV) process.
Step 3-The LOOCV is initiated by extracting one sample from the dataset.
Step 4-PLINK is used to perform odds ratio analysis on the remaining samples (this avoids overfitting). X number of variants top significant variants are used for the next.
Step 5-A linear regression model is trained using the data and features for the specific iteration.
Step 7-Steps 2-5 are repeated for every value of X HuVarBase [46], DisGeNET [47] and OpenTargets [48]), on which, using Venn Diagrams, we superimposed the results of our 2 approaches. This also allowed to identify genes that were found de novo as being implicated in autism by our study.

Functional analysis
To better understand the pathways and mechanisms involved in Severe-Mild autism classification and subphenotyping, we performed a variety of functional analyses on our gene data from both approaches. REACTOME [49] was used for the first pass identifying genes involved in various pathways. The genes not found in REACTOME were manually researched in literature and other sources like GeneCards [50], STRING [51], Uniprot [52], Mammalian phenotype Ontology [53] and Gene Ontology [54] for their functional associations.

Results
Results of the performance of the two groups according to our clinical measures on the experimental tasks are presented in Table 3 as well as between group comparisons. For the different tasks (all except picture comprehension), non-parametric tests (Mann-Whitney test) were conducted to compare performances of ASD_MH and ASD_L groups.

Cognitive measures
As expected by the inclusion criteria of the groups, in non-verbal IQ ASD_L group had worse performance (Mann-Whitney U = 0, p < 0.001). The same holds for attention total score, auditory attention, visual attention and visual range attention (Mann-Whitney U = 13, 24, 7 and 20.50, respectively, p < 0.001). In VSTM sentence and word recall the difference between the two groups was also significant (Mann-Whitney U = 11.50 and 4 respectively, p < 0.001), as well as in immediate and delayed visual memory, visual information recall, information retention factor and recognition (Mann-Whitney U = 13.50, 24, 32, 32 and 10 respectively, p < 0.001).

Language measures
In expressive vocabulary, the difference between the two clinical groups was significant as expected (Mann-Whitney U = 0, p < 0.001). In narration, the difference in groups' performance was also significant in both total elements and total sections (Mann-Whitney U = 25.50 and 32, respectively, p < 0.001).
In sum, in all measures, both cognitive and language, the ASD_MH group outperformed the ASD_L group.

Machine learning data analysis
Machine learning was performed by utilizing the severe autism (n = 15) and non-severe autism (n = 18) samples to train the linear regression model classifier described in the Material and Methods. As detailed in Fig. 1, a leave-one-out cross-validation (LOOCV) procedure was used for to assess the classifier performance. Feature, or variant selection was coupled to the LOOCV procedure to ensure an optimum set of best classifier variants is obtained. The optimum set was determined to be the top 26 variants for every LOOCV iteration and therefore these variants were selected for downstream functional analysis. To assess the performance of each feature selection run, the accuracy, specificity, sensitivity and Matthew's correlation coefficient (MCC) were calculated. Results are summarized in Fig. 3. Comparison of the optimum results attained by the molecular subtype classification defined by our risk model, with prior clinical grading, showed that they were in agreement with 81.81% (27/33 samples) prediction accuracy. Sensitivity, specificity and MCC achieved values of 73.33%, 88.89% and 0.634, respectively. Plotting receiver operating characteristic (ROC) curves resulted in area under the curve (AUC) with value 0.83. Visualization of risk model outputs for all samples using clustering algorithms (including annotation with clinical metadata) is shown in Fig. 4. The top 26 variants obtained from every LOOCV iteration were pooled together to obtain a total of 84 unique variants. Table 4 shows a list of these variants as well as their genes and full annotation including the frequency of occurrence according to the 1000 genome project (aaf_1kg_all). Full annotation for these variants including levels of heterozygosity and homozygosity and annotation in clinical database such as ClinVar. Notable molecular significant variants from the list are known to be implicated in the genetic predisposition of certain diseases and disorders including: certain cardiomayopathies (rs12063382), hypertension (rs1061157), afibrinogenemia (rs2070018), ciliary dyskinesia (rs3752042), congenital cataract (rs4682801), prostate cancer (rs1328285, rs9890913), infantile epilepsy and Parkinson's (rs56260729), mental disability and schinzel-giedion syndrome (rs12922670, rs11082414), cerebellar hypoplasia (rs77247739), kabuki syndrome   (rs5952285, rs5952682), autism (rs7049300) and even response to drug administration such as ezetimibe (rs10264715). In addition to the variant biological insights, the same thought process can be applied on the genes themselves. In total this method highlighted 60 genes, 12 related to mild and 48 to severe autism. Out of those 3 related to mild autism and 12 related to severe autism are already known and can be found in the validation dataset created from 5 databases which is described in our methodology. These results can be found in Table 1.

Literature-based approach-IGs
Using the approach previously described in our methodology, 1005 unique genes were evaluated as being homozygous to the reference alternate allele and marked in the SIFT and Polyphen databases as IGs in all our samples. Before focusing on sub-phenotypes we just pooled the IGs from children with mild and severe autism, respectively, together and validated this IGs dataset versus the 5 databases (See Methodology). In total 96 IGs of mild autism and 98 IGs of severe autism were found in the validation dataset. Out of those 70 were common between them. Investigation of the sub-phenotypical sample clusters of IQ, Memory, Attention, and Verbal, based on the clinical observations, led to identifying 14 IGs which were present in all sub-phenotypes regardless of severity (Fig. 5 in yellow background).
In samples from children with mild autism, the "IQ mild" sub-phenotype had 18 IGs common in all samples of the group, the "Memory mild" 18 IGs, the "Verbal mild" 17 IGs and the "Attention mild" 38 IGs. Also 3 IGs (CDCP2, CELSR1, OR2L8) were common between all mild sub-phenotypes. The "IQ mild", "Memory mild" and "Verbal mild" sub-phenotypes were almost identical, with the exception of OR11L1 for "IQ mild" and ZNF534 for "Memory mild" being highlighted as IGs. OR11L1 was missing as IG for "Memory mild" and "Verbal mild" but was highlighted in "Attention mild". There is also a group of 7 genes (C11orf35, DCLRE1A, IGSF10, LRRC56, MS4A14, OR56B1, OR5AU1) which were highlighted as IGs only in "Attention mild".
Samples from children with severe autism when studied per sub-phenotype highlighted 22 IGs for the "IQ severe group", 24 IGs for "Memory severe", 24 IGs for "Verbal severe" and 14 IGs for "Attention severe". The IG PTPRQ was found in all our severe autism samples. The "Memory severe' and "Verbal severe" sub-phenotype IGs were identical. In addition, the IGs OR6C65 and SALL3 were only found in these 2 sub-phenotypes. IGs KRTAP10-11, ZNF534, DNAH14, FBXW8, NIPA1, TPD52L3 were common in all severe sub-phenotypes with the exception of "Attention severe", whereas C3orf18 was found as IG only in "IQ severe", "Memory severe" and "Verbal severe". Finally, the "Attention severe" group was the only one lacking the HPS4 severity-independent IG.
Another round of validation versus the 5 database dataset was performed for these sub-phenotype IGs ( Table 2). In total 10 IGs exist in both our data and the validation dataset. Their breakdown per sub-phenotype is: NEMF was highlighted for all sub-phenotypes regardless of severity. NIPA1 was highlighted in all severe subphenotypes except for "Attention severe" in which it was found in 87% of samples, and all samples of "Attention mild". CELSR1 was validated for all mild sub-phenotypes. MS4A14 was validated only in "Attention mild". Finally, "Attention mild" was the only sub-phenotype with GALNTL5.

Functional analysis
As discussed in our methodology section, the genes highlighted by our two approaches were investigated regarding their functional role and their participation in biological processes. The results were grouped according to their function into 15 distinct categories: Developmental Biology, Nervous System Development, Synapses-Neurotransmission, Morphogenesis And Structure, Trafficking And Transport, Sensory, Cell Signaling, Cell Migration/Motility, Differentiation, Cell Cycle, Programmed Cell Death, Epigenetics, Metabolism, Post-Translational Modifications and Immunosystem.
For the genes highlighted by our machine learning approach, Fig. 6 showcases the Autism Mechanisms (AMs) implicated in severe and mild autism respectively. In total for the category Developmental Biology 7 genes in severe and only 1 in mild autism are involved. For the Nervous System Development 7 genes are involved in severe autism and 3 in mild. For Synapses-Neurotransmission 4 genes involved in severe and 1 in mild autism. For Morphogenesis and Structure 10 genes are  Breaking down the AMs brought to the foreground using our literature-based method (Fig. 7), there are 69 severity-independent AMs which span over all our categories except Epigenetics and Metabolism which appear to be severity-associated. In the severe autism group the AMs associated with IGs of "Language severe" and "IQ severe" are identical with the exception of the "Gene silencing" AM found only in the "IQ severe" due to the IG C3orf18 and in total have 36 common AMs. Children in these groups also don't appear to have AMs related to neurotransmission and cell cycle events. Likewise, the "Verbal severe" and "Memory severe" sample groupings are identical regarding their 51AMs (which is to be expected since they share the same IGs). There are no AMs associated with neurotransmission and cell cycle processes in these 2 groups, The "Attention severe" AMs are all related to the PTPRQ IG which is the only severitydependent IG in the group. PTPRQ is linked with developmental, morphogenic, sensory and signaling processes.
In our mild autism sub-phenotypical groupings only the "Memory mild" and "Verbal mild" are completely alike. These include 23 AMs in total which are associated with general and nervous system development, neurotransmission and morphogenesis. The "IQ mild" IGs are involved in 25 AMs which do not include any associated with trafficking and transport, cell cycle, epigenetic, metabolic or immunological processes. The "Attention mild" IGs are involved in 41 AMs from all our categories but do not include any epigenetic modifications. Finally, the "Language mild" AMs are the most complex category spanning across 77 AMs from all our categories including the epigenetic histone phosphorylation.
In general, for both approaches many variants found in individual genes, like AGRN (which is involved in 9 functional modules in severe ASD), ARHGEF11, NR0B1, NGEF, FOXN3, ITGF2 and MCF2, appear to be connected to a multitude of biological processes. Therefore, perturbations in any of these crucial genes, which have multiple functional involvements, may trigger the advent of disorders related to the structures and function of the CNS. It is also revealed that some genes like HEATR5A, ITGF2, KRTAP10-1 and MCF2 are involved in a single process like the morphogenesis and structure of synapses. Furthermore, we found mutational events in various proteins involved in sensory pathways which could explain the broad range of sensory abnormalities regularly observed in individuals across the autistic spectrum. Several of our findings highlighted perturbations in sensory and perceptual pathways which may explain impairments of attention, IQ, verbal ability and memory. For example, in all our severe autism samples (and none from the mild autism grouping) a common affected gene (PTPRQ) is found in all four clinical sub-phenotypes which is linked in literature to auditory impairment [55]. This gene can potentially serve as a biomarker of autism severity.
Deleterious/damaging variants in genes which encode signaling proteins can significantly alter the course of brain development, synaptic structure/ function and morphogenesis. For instance, the NLGN protein, found as significant in our results, plays an important role in synapsis and has been implicated by previous works in ASD [56]. In general, gene-encoding protein signaling is fundamental in neurodevelopment and post neurodevelopment processes such as synapse organization (AGRN, TJP1), cell migration (ACTN2, TJP1), axon guidance (ACTN2, AGRN, TNK2, ARHGEF11, NGEF, FGA) and dendrite development (AGRN), and any perturbation in processes like these may trigger the rise of disorders related to the structures and functions of the CNS. Also, BMPs, whose signaling has been shown to be dysregulated in ASD, constitute the largest subdivision of the TGF-β superfamily and are critical in the development of the CNS.

Discussion
Autism is a neurodevelopmental disorder with heterogeneous genomic and phenotypical characteristics. There is also a high hereditary factor involved in its presentation but also a discrepancy regarding the sex of patients with a known 4 to 1 male to female ratio which is also present in the current study [57]. Our design assesses rare genetic risk variations in ASD to predict symptom severity based on genetic variation but also studies the perturbed gene effects on autism-related clinical observations linking the phenotypical to the genetic variation of ASD. This link of heterogeneity could involve many types of variables. Our observations distinguish clinical severity on a variety of characteristics like IQ, memory, attention, and verbal ability with the IQ range having the highest impact. These observations are strong candidate sources of etiologic differences and were carried out in a real-life setting, enabling us to determine baseline information for the bioinformatics approaches. The latter include both a machine learning and a literature-based technique, in order to validate and substantially extend current knowledge on ASD phenotypical severity but also individual characteristics.
Current criteria for clinical classification of ASD individuals provide a valuable behavioral depiction of the disorder but often fall short when grading them into severe vs. non-severe target groups. In this study, we use a WES dataset of 33 children with ASD. Given the small sample size of our dataset we focus more on the potential of these methodologies for disorder classification and the novelty of some of the identified genes. Using a machine-learning algorithm, based on linear regression polygenic risk score assessment, we select informative genes with the potential to contribute toward the grading of ASD. We obtain specific molecular signatures from severe and non-severe subtypes of autistic samples and show that these molecular signatures have the potential to define prognostic subclasses. Further experimentation is required before the role of the genes and variants can be deemed diagnostic. These subclasses involve in total 48 genes linked to severe ASD out of which only 12 have been previously identified and 12 genes linked to mild ASD with only 3 represented in current databases. We further show that 28/84 identified variants were found to underline specific functional classes related to autism and intellectual disability, as well as other disorders. Polygene risk score grading of samples using top variants is in agreement to prior clinical grading for our dataset with 81.81% prediction accuracy; thus, showing that our model is capable of recapitulating the clinical diagnostic methodology employed for this small number of children. However, we further show that six samples were particularly challenging to diagnose molecularly. This can be attributed to a variety of reasons, including methodological or biological factors which can always contribute to variation during experiments. Moreover, as seen in Fig. 5 we observe that there is a specific genetic profile that extends to all children with ASD in the severe subtype for IQ, memory and verbal skills (and in the vast majority of children for attention skills) as well as to all children in the mild subtype for attention. These results are in line with our phenotypical data for visual and auditory attention skills and in accordance with the literature, where both visual and auditory attention disorders have been found for mild and severe phenotypes in ASD [58][59][60]. These results also suggest that a high percentage of children present Attention Deficit Hyperactivity Disorder (ADHD) symptoms which might be due to either comorbidity or due to a common underlying factor [61].
In addition, this work employs a linear regression machine-learning model grading of these samples into molecular subtypes. Our approach allows for the extraction of genomic signatures from the bipartite (severe vs. non-severe) autistic classification scheme used to train our risk model. These signatures show differences in prognosis when compared to clinical grading showing valuable additive information that is impossible to obtain form clinical diagnosis alone. We envision that the identification of the novel set of variants and genes underlying these molecular signatures will enable autistic diagnosis to progress toward a more quantitative realm, where individuals with ASD are viewed within an autistic spectrum rather than the categorical grouping into distinct subtypes. We emphasize that the interpretation of classification results based on genomic data must be accompanied by clinical annotation on as many levels as possible. Only by the integration of such work with expert clinical and pathological annotation, can we maximize the value of genotypic data, increase our understanding of autistic pathology, and further develop current diagnostic and therapeutic approaches.
To further extend this last point, we employed a literature-based approach taking into consideration known effects of genetic variations on the translated proteins. This highlighted 14 (13 novel and 1 known) autismrelated genes, independent of phenotypical severity, which contain various severe protein changes. We believe that these genes constitute a genetic background which has the potential to characterize children with ASD. Clusters of novel genes carrying "deleterious/damaging" variants are found to signify different degrees of severity. Furthermore, a plethora of genes have been linked to specific disorder manifestations (sub-phenotypes) namely IQ, memory, attention and verbal impairments and can help elucidate the symptomatology of ASD's severity. These genes have also allowed for ascertaining specific groupings of sub-phenotypes based on their common genetic signatures, like the memory and verbal traits which seem to have identical IGs in their severe state.
Both previously described bioinformatics approaches have allowed further function dissection and highlighting of important biological processes which bridge the gap between genotype and phenotype in autism. By analyzing the molecular profile based on the clinical severity of ASD as a whole and its individual core features, we identified several potential molecular signatures of disorder and symptom severity. Despite the complex architecture of mutational events associated with autism, the various proteins involved, appear to converge on common processes such as synaptic function, brain development, chromatin remodelers (epigenetics processes), cell life processes, morphology/structures and function, sensory and signaling pathways. The autism-related core features which arise from underlying vulnerabilities are related to pleiotropic genes which associate with important molecular mechanisms.
Several observations regarding gene participation in specific and multiple biological processes were made, covering a wide range of functions.
Macroscopically, there is a high diversity of pathway groupings in our results. It is important to underline that there are considerably more genes perturbed in biological processes which relate to the CNS and neurodevelopment when examining the severe side of autism. There is also strong indication of higher variant occurrence in severe ASD where we observe that in CNS development there are 8 genes (ACTN2, AGRN, TNK2, ARHGEF11, NGEF, KDM6A, AGRN, NR0B1) highlighted in severe cases and only 4 (TJP1, ZNF131, NCOA6, FGA) in mild, in synapsis and neurotransmission 4 (AGRN, ACTN2, NLGN4X, NAT2) in severe versus only 1 (TJP1) in mild among other examples. Our findings also support the idea that ASD-associated genes may contribute not only to core characteristics of ASD but also potentially enhance vulnerability to other systemic problems including metabolic conditions, immune system dysfunction etc. something that is also previously described in literature [62].
We hereby acknowledge that our study has some limitations. The sample is small and thus our classification approach is mostly centered on the prediction and verification of these genes and does not hold a diagnostic nature per se. Moreover, the genes identified should be treated with caution again due to the limited sample size of our dataset.

Conclusions
In conclusion, this study utilizes machine learning classification tools to obtain novel genes implicated in mild/ moderate or severe ASD symptoms by constructing SNPbased classification models with 82% prediction. Our de novo implicated ASD risk genes appear to provide a substantial extension of previously reported genes, enriching current ASD-gene and variant databases. These risk genes can potentially be used to distinguish children with different degrees of ASD symptom severity, however substantial further experimentation is required to fully validate their diagnostic capacity. We also provide further clarification of the relationship between ASD risk mutations and intellectual disability [low on intelligence quotient-(IQ)], and impairment in memory, verbal disturbances and attention deficits. We believe that this study will help bridge the genotype-to-phenotype gap in ASD, illuminating how genetic variation can drive the severity of the disorder and/or specific pathological traits exhibited by individuals with ASD. By predicting the disorder's severity genetically, children with ASD could receive more targeted care.